Load Libraries

Load Data

Exploratory Data Analysis

OUTLIERS

Our dataset does not have any missing values.

Our response variable "Quality" is a categorical one with the ranks from 1 to 10. The range of values (min/max) across the variables does not require scaling or normalization. All predictors are numerical.

Covariance Matrix

There is some significant multicollineraity between several variables:

density - alcohol total.sulfur.dioxide - free.sulfur.dioxide density - residual.sugar residual.sugar - total.sulfur.dioxide density - fixed.acidity

Certain predictors will have to be removed. Most likely candidates are:

free.sulfur.dioxide residual.sugar fixed.acidity alcohol

Data Preparation

MODELING

We will start with converting scale values for "quality" from 1-10 to a binary system (0,1). A 1 will represent good wine and a 0 will represent "bad" wine. We will analyze initial dataset using ordinal logistic regression at the end of this research.

At first, we will use logistic regression, KNN, SVM, Trees etc..

For Logistic regression, we romoved the following non-significant predictors:

density fixed.acidity residual.sugar

This is well-aligned with the conclusions from the covariance matrix

Logistic Regression

Logistic Regression with All Predictors

Since p-value = 0.0, then there are no non-significant coefficients.

Logistic Regression with Significant Predictors

KMEANS

KNN

SVM

Ordinal Regression

Random Forest

Decision Tree

Cross validation before pruning

Pruning

Implementation of Monte Carlos CV for: Logistic Regression, KNN, SVM, Ordinal Regression, Random Forest, Decision Tree¶

BOOSTING

All variables

Only SIGNIFICANT variables